Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sycl-post-link] Do not save all split modules in memory #5021

Conversation

mlychkov
Copy link
Contributor

Now all split modules are first saved in memory and then processed one by one.
Each split module may require tens of megabytes in memory and total amount of
split modules may be significant. It all may lead to overflow of RAM memory.

Rewrite module splitting in a way that each split module is processed and
immediately saved into a file without temporary collecting all modules in RAM.

Signed-off-by: Mikhail Lychkov mikhail.lychkov@intel.com

Now all split modules are first saved in memory and then processed one by one.
Each split module may require tens of megabytes in memory and total amount of
split modules may be significant. It all may lead to overflow of RAM memory.

Rewrite module splitting in a way that each split module is processed and
immediately saved into a file without temporary collecting all modules in RAM.

Signed-off-by: Mikhail Lychkov <mikhail.lychkov@intel.com>
EntryPointsGroupScope Scope = selectDeviceCodeGroupScope(*M);
groupEntryPoints(*M, GMap, Scope);
}
EntryPointsGroupScope Scope = selectDeviceCodeGroupScope(*M);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: global scope is equivalent to "no split", so the DoSplit/IsSplit parameter/field seems redundant and somewhat cofusing (affects split logic in two places - scope selection and splitter constructor here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this simplification.
However, there is a difference in output of sycl-post-link tool between no split and split in global scope now. Without split option input module is saved almost without modifications (only spec constants update). But if module is split in global scope then only entry points with their dependencies are left and all other device functions are dropped.
I'm ok with setting scope to global if split option is not specified if it won't impact users. But I'd commit this change in a separate PR to leave ability to simply revert that change in case if it will introduce regression.

@bader
Copy link
Contributor

bader commented Nov 25, 2021

/summary:run

@bader bader merged commit 6266820 into intel:sycl Nov 26, 2021
@mlychkov mlychkov deleted the private/mlychkov/sycl_post_link_1mod_refactoring_v2_2 branch November 26, 2021 07:00
AlexeySachkov added a commit that referenced this pull request Apr 28, 2023
)

#### Intro

This is a refactoring of how we perform device code split in
`sycl-post-link`, which is intended to solve several existing issues
with the current implementation:
1. increased peak RAM consumption by `sycl-post-link`
2. bad scaling with more and more split "dimensions" being added
3. increased tests maintenance cost due to non-deterministic order
(between commits) of output files produced by `sycl-post-link`

#### A bit more context about the issues above:

(1) Increase peak RAM consumption is caused by the fact that we
currently preserve **all** splits in-memory, even though we can process
them on-by-one and discard them as soon as we stored them to a disk.
This was implemented as a memory consumption optimization in #5021, but
it got accidentally reverted in #7302 as an attempt to workaround (2).

(2) is pretty much summarized in our source code:

https://github.com/intel/llvm/blob/afebb2543ccecb89f83c84b68fba7616bbab89ac/llvm/tools/sycl-post-link/sycl-post-link.cpp#L806-L811

(3) is caused by a bad implementation decision made in #7302: because
every split is now identified by a hash, every time you add a new split
"dimension"/new feature to an account, it results in different hashes
for existing tests. Just look how many unrelated tests had to be updated
in #7512, #8056 and #8167

#### Now to the PR itself:

It introduces a new infrastructure for categorizing/grouping kernel
functions: instead of using hashes, we now build a string description
for each kernel function and then group kernels with the same
description string together.

String description is built by a new entity: it accepts a set of rules,
where each rule is a simple function which returns a string for passed
`llvm::Function`. Results of all rules are concatenated together and
rules are invoked in a stable order of their registration.

There is a simple API for building those rules. It provides some
predefined rules for the most popular use cases like turning a function
attribute or a metadata into a string descriptor for the function. There
is also a possibility to pass a custom callback there to implement more
complicated logic.

#### How does this PR help with issues above?

(1) and (2) are fixed in conjunction: `sycl-post-link` was refactored to
avoid storing more than one split module at a time and that is possible
because the PR unifies per-scope and optional-kernel-features splitters
into a single generic splitter. The new API for kernels categorization
seems to be flexible enough to provide that infrastructure so merged
splitters still look OK code-wise.

(3) is caused by using string identifiers instead of hashes as well as
by using a data structure which sorts identifiers.

#### Any other benefits from this PR?

About 50 lines of code less to support :)

Extending device code split for more optional features would be even
easier than it is now: instead of adding several changes to various
places around `UsedOptionalFeatures` structure, it will be just adding a
1-3 lines of code. Please also note that `UsedOptionalFeatures` contains
tons of inconsistencies in its implementation, which will all gone with
this PR: in `operator==` we don't use hash and instead compare certain
fields directly (and we do miss some of them); `generateModuleName`
method skips some of optional features and ignores them.

Cross-module `device_global` usages checks should now work at all split
dimensions (except for ESIMD).

#### Any potential downsides?

With current `UsedOptionalFeatures` there is a possibility to embed
various information (used aspects, `large-grf` flag, etc.) directly
during device code split to avoid re-gathering that information later
when we generate properties. With the suggested approach, it would be
harder to do, because it doesn't seem to naturally fit to the proposed
infrastructure: see changes I did around `large-grf` in this PR.

However, we have never actually implemented this and re-querying some
metadata from function doesn't seem like a bottleneck, so it should
really be a very minor and only theoretical downside.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants